[SPARK-19727][SQL] Fix for round function that modifies original column#17075
[SPARK-19727][SQL] Fix for round function that modifies original column#17075wojtek-szymanski wants to merge 7 commits intoapache:masterfrom
Conversation
|
I dont' know the code well enough to really evaluate this, but, I see that CC maybe @cloud-fan or @yjshen ? |
|
I think we should fix |
|
Good idea @cloud-fan. I will look for usages of |
|
I have just started refactoring of Here are my first thoughts:
I would be grateful if you could confirm if it's the right way to go. |
|
how about we add a new method |
|
It seems it makes more sense now, please have a look. |
| case StringType => | ||
| buildCast[UTF8String](_, s => try { | ||
| changePrecision(Decimal(new JavaBigDecimal(s.toString)), target) | ||
| toPrecision(Decimal(new JavaBigDecimal(s.toString)), target) |
There was a problem hiding this comment.
Looks like here we don't need to create a new instance?
| buildCast[Long](_, t => changePrecision(Decimal(timestampToDouble(t)), target)) | ||
| buildCast[Long](_, t => toPrecision(Decimal(timestampToDouble(t)), target)) | ||
| case dt: DecimalType => | ||
| b => changePrecision(b.asInstanceOf[Decimal].clone(), target) |
There was a problem hiding this comment.
I think this is the only case we need toPewcision
There was a problem hiding this comment.
Nope, there is one more here:
case BooleanType =>
buildCast[Boolean](_, b => toPrecision(if (b) Decimal.ONE else Decimal.ZERO, target))
Both, ONE and ZERO are singletons so changing precision on themselves is not a good idea.
| * Create new `Decimal` with given precision and scale. | ||
| * | ||
| * @return `Some(decimal)` if successful or `None` if overflow would occur | ||
| */ |
There was a problem hiding this comment.
nit: the style should be
def xxx(
para1: xxx,
para2: xxx): T = {
There was a problem hiding this comment.
Fixed, thanks
| val value = this.clone() | ||
| value.changePrecision( | ||
| DecimalType.bounded(precision - scale + 1, 0).precision, 0, ROUND_FLOOR) | ||
| value |
There was a problem hiding this comment.
shall we assume toPrecision will always return Some here?
There was a problem hiding this comment.
Theoretically, it should be Some. On the other hand if something goes wrong when setting new precision in floor or ceil, I would raise an internal error:
def floor: Decimal = if (scale == 0) this else {
val newPrecision = DecimalType.bounded(precision - scale + 1, 0).precision
toPrecision(newPrecision, 0, ROUND_FLOOR).getOrElse(
throw new AnalysisException(s"Overflow when setting precision to $newPrecision"))
}
| } | ||
|
|
||
| test("round/bround with data frame from a local Seq of Product") { | ||
| val df = spark.createDataFrame(Seq(NumericRow(BigDecimal("5.9")))) |
There was a problem hiding this comment.
we don't need to create NumericRow, try Seq(BigDecimal("5.9")).toDF("value")
There was a problem hiding this comment.
Actually, the problem occurs only when creating data frame from Product. Unable to reproduce the issue with Seq(BigDecimal("5.9")).toDF("value")
There was a problem hiding this comment.
this is weird, can you look into it? spark.createDataset(Seq(BigDecimal("5.9"))) should produce the same result.
There was a problem hiding this comment.
Sure, I will try to investigate where is the difference
There was a problem hiding this comment.
The fundamental difference is in the underlying row type assigned to dataframe/dataset. Dataframe is based on GenericInternalRow, while Dataset uses UnsafeRow. During evaluation of round expression, method getDecimal is called on a row, see BoundAttribute.scala#L52. As a result GenericInternalRow returns just an element of an array, which points to the reference of the original column, see rows.scala#L200. Strategy used in UnsafeRow is completely different, so new decimal instance is created, see UnsafeRow.java#L399.
I hope it helps to explain why only dataframe is affected.
| value.changePrecision( | ||
| DecimalType.bounded(precision - scale + 1, 0).precision, 0, ROUND_FLOOR) | ||
| value | ||
| toPrecision(DecimalType.bounded(precision - scale + 1, 0).precision, 0, ROUND_FLOOR) |
There was a problem hiding this comment.
This might end up creating two copies of the object in worst case :
- once in
toPrecision - second time on this line
Old logic would guarantee a single copy.
There was a problem hiding this comment.
You're right, thanks. My suggestion is to raise an internal error if setting new precision in floor or ceil would fail.
| value.changePrecision( | ||
| DecimalType.bounded(precision - scale + 1, 0).precision, 0, ROUND_CEILING) | ||
| value | ||
| toPrecision(DecimalType.bounded(precision - scale + 1, 0).precision, 0, ROUND_CEILING) |
There was a problem hiding this comment.
See my comment above
| } | ||
|
|
||
| test("changePrecision() on compact decimal should respect rounding mode") { | ||
| test("changePrecision/toPrecission on compact decimal should respect rounding mode") { |
There was a problem hiding this comment.
nit: typo in toPrecission
There was a problem hiding this comment.
Thanks, fixed
| * | ||
| * @return `Some(decimal)` if successful or `None` if overflow would occur | ||
| */ | ||
| private[sql] def toPrecision( |
There was a problem hiding this comment.
Fixed, thanks
| ) | ||
| checkAnswer( | ||
| df.withColumn("value_rounded", bround('value)), | ||
| Seq(Row(BigDecimal("5.9"), BigDecimal("6"))) |
There was a problem hiding this comment.
bround function is also affected. Column value_rounded renamed to value_brounded
| * Create new `Decimal` with given precision and scale. | ||
| * | ||
| * @return `Some(decimal)` if successful or `None` if overflow would occur | ||
| */ |
There was a problem hiding this comment.
style:
def xxx(
para1: xxx,
para2: xxx): XXX
| checkAnswer(df.selectExpr("positive(b)"), Row(-1)) | ||
| } | ||
| } | ||
| case class NumericRow(value : BigDecimal) |
There was a problem hiding this comment.
let's just use Tuple1 instead of creating this class
There was a problem hiding this comment.
replaced with Tuple1
| * | ||
| * @return `Some(decimal)` if successful or `None` if overflow would occur | ||
| */ | ||
| private[sql] def toPrecision( |
There was a problem hiding this comment.
4 spaces indention here. please take a look at other methods in spark and follow the code style.
There was a problem hiding this comment.
Actually I did, but I saw so many different styles and I had no idea which one is correct. Thanks again for your patience
|
LGTM, pending tests |
|
@cloud-fan could you please give the green light to tests? |
|
ok to test |
|
sorry forgot the trigger the test... |
|
Test build #74135 has finished for PR 17075 at commit
|
|
retest this please |
|
Test build #74152 has finished for PR 17075 at commit
|
|
retest this please |
|
Test build #74181 has finished for PR 17075 at commit
|
|
retest this please |
|
Test build #74186 has started for PR 17075 at commit |
|
retest this please |
|
Test build #74195 has finished for PR 17075 at commit
|
|
thanks, merging to master! |
…ginal column ## What changes were proposed in this pull request? This is a followup of #17075 , to fix the bug in codegen path. ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #19576 from cloud-fan/bug. (cherry picked from commit 7fdacbc) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
…ginal column ## What changes were proposed in this pull request? This is a followup of apache#17075 , to fix the bug in codegen path. ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes apache#19576 from cloud-fan/bug. (cherry picked from commit 7fdacbc) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
What changes were proposed in this pull request?
Fix for SQL round function that modifies original column when underlying data frame is created from a local product.
How was this patch tested?
New unit test added to existing suite
org.apache.spark.sql.MathFunctionsSuite